This report explores the relationship between weather patterns and crime rates in Colchester during the year 2024. Central research questions were: which season was the safest in Colchester? This leads to broader inquiries into whether crime patterns changed in response to fluctuations in weather conditions, and whether particular types of crime were more likely to occur in each seasons.
To investigate these questions, two data sets were used. The first, crime24.csv, contains crime data for Colchester throughout 2024, detailing the nature, location, and timing of reported incidents. The second, temp24.csv, provided daily meteorological data from a weather station, capturing variables such as temperature, rainfall, humidity, sunlight hours among others.
Before any analysis of the data, data pre-processing steps were carried out to ensure both data sets were clean, consistent, and properly formatted. This involved handling missing values, parsing dates, generating season-based groupings, and aggregating observations for meaningful comparisons. Once prepared, the data sets were merged using a common variable (year_month), enabling an integrated view of weather and crime in Colchester.
The analysis incorporated a variety of visual and statistical methods to explore potential relationships between weather conditions and crime. These included summary tables, bar and density plots, violin plots, scatterplots, correlation matrices, time series visualisations with smoothing, and an interactive spatial map. These tools provided an evidence-based foundation for interpreting the temporal and spatial dynamics of crime in Colchester.
The data preparation stage began with the installation and loading of several R packages required for the analysis. These libraries included tidyverse, lubridate, ggplot2, plotly, and leaflet, among others, each providing essential functions for data manipulation, visualisation, and interactivity.
Following this, the two data sets crime24.csv and temp24.csv were imported into the R environment and previewed. This initial exploration provided insight into the structure and contents of each data set and guided the necessary data pre-processing steps.
It was observed that both data sets contained a number of irrelevant or redundant columns which were subsequently removed to streamline the analysis. Additionally, missing values were present in several variables. In the case of the weather data, missing temperature and rainfall values were imputed where appropriate using mean or median substitution. However, in instances where values were missing across multiple key fields or entire rows, such records were excluded to maintain data integrity.
Further cleaning included the parsing and formatting of date variables. The crime data set featured dates in a “YYYY-MM” format, while the weather data set used full “YYYY-MM-DD” dates. To enable merging and season aggregation, both date formats were standardised, and a new variable (year_month) was generated in each data set to represent monthly time intervals. A season variable was also created using the month of each observation to classify entries into Winter, Spring, Summer, or Autumn.
Finally, the cleaned data sets were merged using the year_month variable, resulting in a combined data set that allowed for comparative analysis between monthly crime levels and prevailing weather conditions. This merged data set formed the foundation for some subsequent visualisations and statistical interpretations.
#load data sets
crime <- read.csv("crime24.csv")
temp <- read.csv("temp24.csv")
# Preview data sets
head(crime)
## X category persistent_id date lat long street_id
## 1 1 anti-social-behaviour 2024-01 51.89301 0.901028 2153130
## 2 2 anti-social-behaviour 2024-01 51.88979 0.898830 2153105
## 3 3 anti-social-behaviour 2024-01 51.89825 0.902107 2153147
## 4 4 anti-social-behaviour 2024-01 51.87837 0.888373 2152856
## 5 5 anti-social-behaviour 2024-01 51.87905 0.889521 2152871
## 6 6 anti-social-behaviour 2024-01 51.88860 0.899203 2153107
## street_name context id location_type
## 1 On or near Middle Mill NA 115967607 Force
## 2 On or near Conference/exhibition Centre NA 115967129 Force
## 3 On or near Mason Road NA 115967591 Force
## 4 On or near Kensington Road NA 115967062 Force
## 5 On or near Lambeth Road NA 115967058 Force
## 6 On or near Trinity Street NA 115967547 Force
## location_subtype outcome_status
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
head(temp)
## station_ID Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1 3590 2024-12-31 6.5 7.7 5.0 4.4
## 2 3590 2024-12-30 5.6 6.9 3.4 4.9
## 3 3590 2024-12-29 3.3 4.9 2.2 3.2
## 4 3590 2024-12-28 4.0 5.8 2.3 3.7
## 5 3590 2024-12-27 5.3 6.7 4.3 5.1
## 6 3590 2024-12-26 6.7 10.0 5.6 6.4
## HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1 86.4 WSW 22.7 42.6 1025.3 0.0 4.5 7.2
## 2 94.9 WSW 16.7 40.8 1028.5 0.0 8.0 8.0
## 3 98.6 W 11.4 22.2 1028.5 0.4 8.0 8.0
## 4 98.4 SW 5.5 14.8 1031.8 0.4 8.0 8.0
## 5 98.4 S 6.3 16.7 1034.7 0.4 8.0 8.0
## 6 98.3 WSW 9.3 22.2 1033.6 0.4 8.0 8.0
## SunD1h VisKm SnowDepcm PreselevHp
## 1 5.7 63.4 NA NA
## 2 0.0 15.3 NA NA
## 3 0.0 0.5 NA NA
## 4 0.0 0.1 NA NA
## 5 0.0 0.5 NA NA
## 6 0.0 0.2 NA NA
dim(crime)
## [1] 6304 13
dim(temp)
## [1] 366 18
The preview of data revealed there are 6304 rows and 13 columns in the crime data, while there are 366 rows and 18 columns in the temp data.
The variables in the crime data set are:
X: row indexes without variable title
category: Category of the crime (https://data.police.uk/docs/method/crime-street/)
persistent_id: 64-character unique identifier for that crime. (This is different to the existing ‘id’ attribute, which is not guaranteed to always stay the same for each crime.)
date: Date of the crime in format: YYYY-MM
latitude: Latitude coordinate
longitude: Longitude coordinate
street_id: Unique identifier for the street
street_name: Name of the location. An approximation of where the crime happened
context: Extra information about the crime (if applicable)
id: ID of the crime. This ID only relates to the API, it is NOT a police identifier
location_type: The type of the location. Either Force or BTP: Force indicates a normal police force location; BTP indicates a British Transport Police location. BTP locations fall within normal police force boundaries.
location_subtype: For BTP locations, the type of location at which this crime was recorded.
outcome_status: The category and date of the latest recorded outcome for the crime
The variables in the temp data set are:
station_ID - WMO station identifier
Date - date (and time) of observations. Format: YYYY-MM-DD
Viskm - visibility in kilometres
TemperatureCAvg - average air temperature at 2 metres above ground level. Values given in Celsius degrees
TemperatureCMax - maximum air temperature at 2 metres above ground level. Values given in Celsius degrees
TemperatureCMin - minimum air temperature at 2 metres above ground level. Values given in Celsius degrees
TdAvgC - average dew point temperature at 2 metres above ground level. Values given in Celsius degrees
HrAvg - average relative humidity. Values given in %
WindkmhDir - wind direction
WindkmhInt - wind speed in km/h
WindkmhGust - wind gust in km/h
PresslevHp - Sea level pressure in hPa
Precmm - precipitation totals in mm
TotClOct - total cloudiness in octants
lowClOct - cloudiness by low level clouds in octants
SunD1h - sunshine duration in hours
PreselevHp - atmospheric pressure measured at altitude of station in hPa
SnowDepcm - depth of snow cover in centimetres
#output data type and structure of data sets
str(crime)
## 'data.frame': 6304 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ category : chr "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ persistent_id : chr "" "" "" "" ...
## $ date : chr "2024-01" "2024-01" "2024-01" "2024-01" ...
## $ lat : num 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num 0.901 0.899 0.902 0.888 0.89 ...
## $ street_id : int 2153130 2153105 2153147 2152856 2152871 2153107 2152963 2152963 2153186 2153163 ...
## $ street_name : chr "On or near Middle Mill" "On or near Conference/exhibition Centre" "On or near Mason Road" "On or near Kensington Road" ...
## $ context : logi NA NA NA NA NA NA ...
## $ id : int 115967607 115967129 115967591 115967062 115967058 115967547 115967516 115967638 115967128 115967378 ...
## $ location_type : chr "Force" "Force" "Force" "Force" ...
## $ location_subtype: chr "" "" "" "" ...
## $ outcome_status : chr NA NA NA NA ...
str(temp)
## 'data.frame': 366 obs. of 18 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : chr "2024-12-31" "2024-12-30" "2024-12-29" "2024-12-28" ...
## $ TemperatureCAvg: num 6.5 5.6 3.3 4 5.3 6.7 9.4 4.3 4.6 7.2 ...
## $ TemperatureCMax: num 7.7 6.9 4.9 5.8 6.7 10 12.3 6.9 7.9 11 ...
## $ TemperatureCMin: num 5 3.4 2.2 2.3 4.3 5.6 3.5 2.5 2.5 3.3 ...
## $ TdAvgC : num 4.4 4.9 3.2 3.7 5.1 6.4 8.8 1.8 -0.5 4.5 ...
## $ HrAvg : num 86.4 94.9 98.6 98.4 98.4 98.3 95.6 84.2 70 83 ...
## $ WindkmhDir : chr "WSW" "WSW" "W" "SW" ...
## $ WindkmhInt : num 22.7 16.7 11.4 5.5 6.3 9.3 15.4 16.4 36.8 28 ...
## $ WindkmhGust : num 42.6 40.8 22.2 14.8 16.7 22.2 31.5 50 70.4 66.7 ...
## $ PresslevHp : num 1025 1028 1028 1032 1035 ...
## $ Precmm : num 0 0 0.4 0.4 0.4 0.4 0 0 0.8 0.8 ...
## $ TotClOct : num 4.5 8 8 8 8 8 6.8 6.7 4.3 6.6 ...
## $ lowClOct : num 7.2 8 8 8 8 8 6.8 7.6 5.2 6.9 ...
## $ SunD1h : num 5.7 0 0 0 0 0 0 1.4 2.8 0 ...
## $ VisKm : num 63.4 15.3 0.5 0.1 0.5 0.2 13.3 20 38.8 34.9 ...
## $ SnowDepcm : int NA NA NA NA NA NA NA NA NA NA ...
## $ PreselevHp : logi NA NA NA NA NA NA ...
To streamline the datasets and retain only relevant variables for analysis, several columns were removed. From the crime data, the columns X, context, location-subtype, and persistent_id were deleted. The X column contained row indices, which were not necessary for analysis. The context column consisted entirely of missing values, offering no usable information. The location-subtype column contained corrupted or malformed entries, including random quotation marks, rendering it unreliable. Similarly, the persistent_id column featured either missing values or random alphanumeric strings that lacked consistent formatting and interpretability.
From the weather data set, the columns PresslevHp and SnowDepcm were excluded. Both had substantial missing values, and preliminary checks indicated they would not contribute meaningfully to the analysis, especially as snowfall levels in Colchester are often minimal or inconsistent.
Following this structural clean-up, the data sets underwent further refinement. Missing values in numeric fields, such as temperature and precipitation, were imputed using the mean of the corresponding column. This approach preserved the data set’s size while minimising the influence of outliers or gaps in measurement.
The date formats in both data sets were also standardised. The crime data’s date variable, originally formatted as “YYYY-MM”, was parsed to generate a full date representation (“YYYY-MM-01”) for consistency. The weather data’s Date variable, already in “YYYY-MM-DD” format, was converted to a proper Date class in R. From these, a new variable—year_month—was derived in both data sets to enable monthly aggregation. Additionally, a season variable was added by mapping each observation’s month to one of the four meteorological seasons: Winter (December–February), Spring (March–May), Summer (June–August), and Autumn (September–November).
These cleaning and formatting steps ensured both data sets were aligned, enabling accurate merging and insightful visual exploration in the respective subsequent analysis stages.
#basic statistics
summary(crime)
## X category persistent_id date
## Min. : 1 Length:6304 Length:6304 Length:6304
## 1st Qu.:1577 Class :character Class :character Class :character
## Median :3152 Mode :character Mode :character Mode :character
## Mean :3152
## 3rd Qu.:4728
## Max. :6304
## lat long street_id street_name
## Min. :51.88 Min. :0.8788 Min. :2152686 Length:6304
## 1st Qu.:51.89 1st Qu.:0.8966 1st Qu.:2153025 Class :character
## Median :51.89 Median :0.9013 Median :2153155 Mode :character
## Mean :51.89 Mean :0.9029 Mean :2153873
## 3rd Qu.:51.89 3rd Qu.:0.9088 3rd Qu.:2153366
## Max. :51.90 Max. :0.9246 Max. :2343256
## context id location_type location_subtype
## Mode:logical Min. :115954844 Length:6304 Length:6304
## NA's:6304 1st Qu.:118009952 Class :character Class :character
## Median :120228058 Mode :character Mode :character
## Mean :120403000
## 3rd Qu.:122339060
## Max. :125550731
## outcome_status
## Length:6304
## Class :character
## Mode :character
##
##
##
summary(temp)
## station_ID Date TemperatureCAvg TemperatureCMax
## Min. :3590 Length:366 Min. :-2.60 Min. : 1.10
## 1st Qu.:3590 Class :character 1st Qu.: 7.00 1st Qu.:10.72
## Median :3590 Mode :character Median :10.95 Median :14.75
## Mean :3590 Mean :10.98 Mean :15.08
## 3rd Qu.:3590 3rd Qu.:14.50 3rd Qu.:19.60
## Max. :3590 Max. :23.10 Max. :29.80
##
## TemperatureCMin TdAvgC HrAvg WindkmhDir
## Min. :-6.100 Min. :-6.000 Min. :59.60 Length:366
## 1st Qu.: 3.325 1st Qu.: 4.725 1st Qu.:75.90 Class :character
## Median : 6.800 Median : 8.200 Median :82.75 Mode :character
## Mean : 6.486 Mean : 7.752 Mean :81.74
## 3rd Qu.: 9.500 3rd Qu.:11.000 3rd Qu.:88.80
## Max. :16.700 Max. :16.900 Max. :98.60
##
## WindkmhInt WindkmhGust PresslevHp Precmm
## Min. : 3.90 Min. : 11.10 Min. : 978.9 Min. : 0.000
## 1st Qu.:12.22 1st Qu.: 31.50 1st Qu.:1007.5 1st Qu.: 0.000
## Median :15.80 Median : 38.90 Median :1013.8 Median : 0.200
## Mean :16.52 Mean : 40.81 Mean :1013.7 Mean : 1.864
## 3rd Qu.:19.80 3rd Qu.: 48.20 3rd Qu.:1021.0 3rd Qu.: 1.600
## Max. :42.50 Max. :105.60 Max. :1037.3 Max. :38.000
## NA's :24
## TotClOct lowClOct SunD1h VisKm
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.10
## 1st Qu.:3.800 1st Qu.:5.800 1st Qu.: 0.325 1st Qu.:20.73
## Median :5.600 Median :6.900 Median : 3.500 Median :30.95
## Mean :5.304 Mean :6.609 Mean : 4.203 Mean :31.42
## 3rd Qu.:7.200 3rd Qu.:7.600 3rd Qu.: 7.100 3rd Qu.:41.20
## Max. :8.000 Max. :8.000 Max. :15.600 Max. :71.20
## NA's :5
## SnowDepcm PreselevHp
## Min. :1.00 Mode:logical
## 1st Qu.:1.25 NA's:366
## Median :1.50
## Mean :1.50
## 3rd Qu.:1.75
## Max. :2.00
## NA's :364
#1.Drop irrelevant columns:
#crime, 13 columns - 4 = 9 columns left
crime<- crime%>%select(-c(X, context, location_subtype, persistent_id))
#temp, 18 columns - 2 = 16 columns left
temp<- temp%>%select(-c(PreselevHp,SnowDepcm))
#check should output 9 and 16
length(crime)
## [1] 9
length(temp)
## [1] 16
#2.Handling Missing values (NAs):
#checking total nas in each column of both data sets
colSums(is.na(crime))
## category date lat long street_id
## 0 0 0 0 0
## street_name id location_type outcome_status
## 0 0 0 710
colSums(is.na(temp))
## station_ID Date TemperatureCAvg TemperatureCMax TemperatureCMin
## 0 0 0 0 0
## TdAvgC HrAvg WindkmhDir WindkmhInt WindkmhGust
## 0 0 0 0 0
## PresslevHp Precmm TotClOct lowClOct SunD1h
## 0 24 0 5 0
## VisKm
## 0
#where the outcome_status==NA, impute Unknown
crime$outcome_status[is.na(crime$outcome_status)]<- "Unknown"
#where the Precmm ==NA, impute column median
temp$Precmm[is.na(temp$Precmm)]<- median(temp$Precmm, na.rm=TRUE)
#Where the lowClOct== NA, impute column median
temp$lowClOct[is.na(temp$lowClOct)]<- median(temp$lowClOct, na.rm=TRUE)
#re-check, should be zero
sum(is.na(crime))
## [1] 0
sum(is.na(temp))
## [1] 0
#3.date and Date variable format, and extract month num for season assignment:
#crime, extract month num so that i can assign season:
crime <- crime %>%
#create 4 new columns date_parsed, year_month, month num and season:
mutate(
# Convert to Date format first
date_parsed = as.Date(paste0(date, "-01")), #forcing date parse
#for consistency create year_month variable and assign existing date variable to it. Because crime date is already in YYYY-MM format keep as is:
year_month = format(date),
#extract month num from date:
month_num = as.numeric(format(date_parsed, "%m")),
#assign seasons:
crime_season = case_when(
month_num %in% 3:5 ~ "Spring",
month_num %in% 6:8 ~ "Summer",
month_num %in% 9:11 ~ "Autumn",
TRUE ~ "Winter"))
#temp, extract month num so that i can assign season:
temp <- temp %>%
mutate(
# Convert to Date format first
Date_parsed = as.Date(Date),
# Create YYYY-MM (to match crime data)
year_month = format(Date_parsed, "%Y-%m"),
# Extract month number for seasons
month_num = as.numeric(format(Date_parsed, "%m")),
# Assign seasons (same logic as crime data)
temp_season = case_when(
month_num %in% 3:5 ~ "Spring",
month_num %in% 6:8 ~ "Summer",
month_num %in% 9:11 ~ "Autumn",
TRUE ~ "Winter"))
# Check crime data
crime %>%
select(year_month, month_num, crime_season) %>%
distinct()
## year_month month_num crime_season
## 1 2024-01 1 Winter
## 2 2024-02 2 Winter
## 3 2024-03 3 Spring
## 4 2024-04 4 Spring
## 5 2024-05 5 Spring
## 6 2024-06 6 Summer
## 7 2024-07 7 Summer
## 8 2024-08 8 Summer
## 9 2024-09 9 Autumn
## 10 2024-10 10 Autumn
## 11 2024-11 11 Autumn
## 12 2024-12 12 Winter
# Check temp data
temp %>%
select(year_month,month_num, temp_season) %>%
distinct()
## year_month month_num temp_season
## 1 2024-12 12 Winter
## 2 2024-11 11 Autumn
## 3 2024-10 10 Autumn
## 4 2024-09 9 Autumn
## 5 2024-08 8 Summer
## 6 2024-07 7 Summer
## 7 2024-06 6 Summer
## 8 2024-05 5 Spring
## 9 2024-04 4 Spring
## 10 2024-03 3 Spring
## 11 2024-02 2 Winter
## 12 2024-01 1 Winter
#check data types and entries, making sure date parsed in both data sets
glimpse(crime)
## Rows: 6,304
## Columns: 13
## $ category <chr> "anti-social-behaviour", "anti-social-behaviour", "anti…
## $ date <chr> "2024-01", "2024-01", "2024-01", "2024-01", "2024-01", …
## $ lat <dbl> 51.89301, 51.88979, 51.89825, 51.87837, 51.87905, 51.88…
## $ long <dbl> 0.901028, 0.898830, 0.902107, 0.888373, 0.889521, 0.899…
## $ street_id <int> 2153130, 2153105, 2153147, 2152856, 2152871, 2153107, 2…
## $ street_name <chr> "On or near Middle Mill", "On or near Conference/exhibi…
## $ id <int> 115967607, 115967129, 115967591, 115967062, 115967058, …
## $ location_type <chr> "Force", "Force", "Force", "Force", "Force", "Force", "…
## $ outcome_status <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", …
## $ date_parsed <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ year_month <chr> "2024-01", "2024-01", "2024-01", "2024-01", "2024-01", …
## $ month_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ crime_season <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Wint…
glimpse(temp)
## Rows: 366
## Columns: 20
## $ station_ID <int> 3590, 3590, 3590, 3590, 3590, 3590, 3590, 3590, 3590, …
## $ Date <chr> "2024-12-31", "2024-12-30", "2024-12-29", "2024-12-28"…
## $ TemperatureCAvg <dbl> 6.5, 5.6, 3.3, 4.0, 5.3, 6.7, 9.4, 4.3, 4.6, 7.2, 4.7,…
## $ TemperatureCMax <dbl> 7.7, 6.9, 4.9, 5.8, 6.7, 10.0, 12.3, 6.9, 7.9, 11.0, 7…
## $ TemperatureCMin <dbl> 5.0, 3.4, 2.2, 2.3, 4.3, 5.6, 3.5, 2.5, 2.5, 3.3, 0.3,…
## $ TdAvgC <dbl> 4.4, 4.9, 3.2, 3.7, 5.1, 6.4, 8.8, 1.8, -0.5, 4.5, 3.4…
## $ HrAvg <dbl> 86.4, 94.9, 98.6, 98.4, 98.4, 98.3, 95.6, 84.2, 70.0, …
## $ WindkmhDir <chr> "WSW", "WSW", "W", "SW", "S", "WSW", "W", "W", "WNW", …
## $ WindkmhInt <dbl> 22.7, 16.7, 11.4, 5.5, 6.3, 9.3, 15.4, 16.4, 36.8, 28.…
## $ WindkmhGust <dbl> 42.6, 40.8, 22.2, 14.8, 16.7, 22.2, 31.5, 50.0, 70.4, …
## $ PresslevHp <dbl> 1025.3, 1028.5, 1028.5, 1031.8, 1034.7, 1033.6, 1026.9…
## $ Precmm <dbl> 0.0, 0.0, 0.4, 0.4, 0.4, 0.4, 0.0, 0.0, 0.8, 0.8, 1.0,…
## $ TotClOct <dbl> 4.5, 8.0, 8.0, 8.0, 8.0, 8.0, 6.8, 6.7, 4.3, 6.6, 4.6,…
## $ lowClOct <dbl> 7.2, 8.0, 8.0, 8.0, 8.0, 8.0, 6.8, 7.6, 5.2, 6.9, 5.5,…
## $ SunD1h <dbl> 5.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.4, 2.8, 0.0, 0.3,…
## $ VisKm <dbl> 63.4, 15.3, 0.5, 0.1, 0.5, 0.2, 13.3, 20.0, 38.8, 34.9…
## $ Date_parsed <date> 2024-12-31, 2024-12-30, 2024-12-29, 2024-12-28, 2024-…
## $ year_month <chr> "2024-12", "2024-12", "2024-12", "2024-12", "2024-12",…
## $ month_num <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ temp_season <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Win…
note: that both temp_season and crime_season are interchangeable because they are based on month_num which both data sets have 12 months all from 2024 and season mapping was mapped following same month definitions identically for each data set.
#one way TABLE: crime counts per category in descending order
crime_table <- crime%>%
count(category, sort=TRUE)
knitr::kable(crime_table, caption = "Crime category counts in Colchester (2024)")
| category | n |
|---|---|
| violent-crime | 2420 |
| anti-social-behaviour | 710 |
| shoplifting | 629 |
| criminal-damage-arson | 479 |
| public-order | 458 |
| other-theft | 412 |
| vehicle-crime | 270 |
| drugs | 265 |
| burglary | 171 |
| bicycle-theft | 149 |
| other-crime | 100 |
| theft-from-the-person | 91 |
| robbery | 85 |
| possession-of-weapons | 65 |
#one way TABLE: weather averages by season
temp_table <- temp %>%
group_by(temp_season) %>%
summarise(
"avg_temp (°C)" = round(mean(TemperatureCAvg),2), #daily avg to seaonal avg
"max_temp (°C)" = round(mean(TemperatureCMax),2), #daily max to seaonal max
"min_temp (°C)" = round(mean(TemperatureCMin),2),
"total_rainfall (mm)" = sum(Precmm), #total rainfall per season
"avg_sunshine (hours)" = round(mean(SunD1h),2), #avg daily sunshine in hours
.groups = "drop") #drop all other variables
knitr::kable(temp_table, caption = "Weather Averages by season (2024)")
| temp_season | avg_temp (°C) | max_temp (°C) | min_temp (°C) | total_rainfall (mm) | avg_sunshine (hours) |
|---|---|---|---|---|---|
| Autumn | 11.22 | 15.02 | 7.18 | 138.6 | 3.32 |
| Spring | 10.22 | 14.38 | 5.86 | 194.2 | 4.65 |
| Summer | 16.33 | 21.67 | 10.09 | 122.0 | 6.95 |
| Winter | 6.12 | 9.18 | 2.78 | 187.4 | 1.86 |
#TWO WAY TABLE OF crime category vs temp_season:
#answers, question: in what season does x crime occur?
#which crimes occur in each season?
#which season had the most crime?
two_way_table <- crime%>%
count(category, crime_season, name= "n")%>%
#pivot to wider format for seasons to be columns & crime count for each season in rows
pivot_wider(names_from = crime_season, values_from = n)%>%
#add row total for category
mutate(
Total = rowSums(across(where(is.numeric))))%>%
#add column total for seasons (sum of crimes per season)
bind_rows(
summarise(.,
category= "Seasons_Totals",
across(where(is.numeric), sum)))%>% #Sum all numeric column
arrange(desc(Total)) #sort by most frequent crime category
knitr::kable(two_way_table, caption = "Crime Category counts by Season in Colchester (2024)")
| category | Autumn | Spring | Summer | Winter | Total |
|---|---|---|---|---|---|
| Seasons_Totals | 1565 | 1541 | 1631 | 1567 | 6304 |
| violent-crime | 595 | 565 | 618 | 642 | 2420 |
| anti-social-behaviour | 170 | 216 | 174 | 150 | 710 |
| shoplifting | 185 | 149 | 137 | 158 | 629 |
| criminal-damage-arson | 96 | 143 | 134 | 106 | 479 |
| public-order | 112 | 99 | 144 | 103 | 458 |
| other-theft | 100 | 110 | 102 | 100 | 412 |
| vehicle-crime | 57 | 47 | 108 | 58 | 270 |
| drugs | 65 | 66 | 48 | 86 | 265 |
| burglary | 50 | 34 | 43 | 44 | 171 |
| bicycle-theft | 60 | 25 | 30 | 34 | 149 |
| other-crime | 22 | 34 | 22 | 22 | 100 |
| theft-from-the-person | 19 | 19 | 27 | 26 | 91 |
| robbery | 24 | 16 | 26 | 19 | 85 |
| possession-of-weapons | 10 | 18 | 18 | 19 | 65 |
#TWO WAY TABLE: for crime type counts per month
tw_table <- crime%>%
count(category, month_num, name= "m")%>% #confirm that there are no missing values by adding , na.rm=TRUE
#pivot to wider format for seasons to be columns & crime count for each season in rows
pivot_wider(names_from = month_num, values_from = m)%>%
#add row total for category
mutate(
Total = rowSums(across(where(is.numeric))))%>%
arrange(desc(Total),na.rm=TRUE) #sort by most frequent crime category
knitr::kable(tw_table, caption = "Crime Category counts by month in Colchester (2024)")
| category | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| violent-crime | 213 | 220 | 188 | 163 | 214 | 192 | 242 | 184 | 223 | 195 | 177 | 209 | 2420 |
| anti-social-behaviour | 42 | 64 | 66 | 70 | 80 | 63 | 53 | 58 | 58 | 56 | 56 | 44 | 710 |
| shoplifting | 48 | 49 | 50 | 40 | 59 | 42 | 58 | 37 | 47 | 64 | 74 | 61 | 629 |
| criminal-damage-arson | 33 | 46 | 37 | 43 | 63 | 44 | 51 | 39 | 33 | 33 | 30 | 27 | 479 |
| public-order | 43 | 36 | 34 | 33 | 32 | 42 | 49 | 53 | 39 | 37 | 36 | 24 | 458 |
| other-theft | 34 | 30 | 35 | 34 | 41 | 34 | 33 | 35 | 32 | 38 | 30 | 36 | 412 |
| vehicle-crime | 16 | 29 | 20 | 14 | 13 | 15 | 41 | 52 | 17 | 27 | 13 | 13 | 270 |
| drugs | 28 | 24 | 29 | 25 | 12 | 12 | 17 | 19 | 25 | 21 | 19 | 34 | 265 |
| burglary | 19 | 15 | 11 | 10 | 13 | 9 | 18 | 16 | 8 | 17 | 25 | 10 | 171 |
| bicycle-theft | 11 | 8 | 7 | 12 | 6 | 9 | 12 | 9 | 12 | 19 | 29 | 15 | 149 |
| other-crime | 11 | 5 | 12 | 10 | 12 | 6 | 7 | 9 | 6 | 12 | 4 | 6 | 100 |
| theft-from-the-person | 11 | 6 | 5 | 6 | 8 | 7 | 12 | 8 | 4 | 7 | 8 | 9 | 91 |
| possession-of-weapons | 9 | 6 | 5 | 5 | 8 | 6 | 5 | 7 | 5 | 3 | 2 | 4 | 65 |
| robbery | 11 | 8 | 3 | 6 | 7 | 9 | 10 | 7 | 10 | 8 | 6 | NA | NA |
Observations:
First table: Crime category counts in Colchester (2024)
Violent crime recorded 2,420 cases.
Anti-social behaviour accounted for 710 cases.
Possession of weapons had the lowest count with 65 cases.
Second table: Weather Averages by season (2024)
Winter showed the lowest average temperature (6.12 °C).
Summer showed the highest maximum temperature (21.67 °C).
Spring had the most rainfall (194.2 mm).
Autumn experienced second relatively low average hours of sunlight.
Third table: Crime counts by season in Colchester (2024)
Fourth table: Crime Category counts by month in Colchester (2024)
Deeper view of frequency of crime types that occurred in colchester 2024 by month.
Seems like there were no robberies that occurred in December 2024.
#crime counts per category bar plot
gg_crime_bar <- crime %>%
count(category) %>%
ggplot(aes(x = reorder(category, n), y = n, fill = category), alpha=0.9)+
geom_col(show.legend = FALSE) + # Remove legend
coord_flip() + # Horizontal bars
labs(title = "Crime Frequency by category in Colchester 2024", x = "Crime Category", y = "Frequency")+
theme_minimal()
#Interactive plot
ploty_crime_bar<- ggplotly(gg_crime_bar)%>% layout(showlegend = FALSE)
ploty_crime_bar
Observation for bar plot: Crime Frequency by category in Colchester 2024
Violent crime occurred most frequently.
Possession of weapons occurred least frequently.
#Temperature distribution
avg_temp_hist <- ggplot(temp, aes(x = TemperatureCAvg)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Temperature Distribution in Colchester (2024)", x = "Temperature (°C)")+
theme_minimal()
ggplotly(avg_temp_hist)
#Max Temperature distribution
max_temp_hist <- ggplot(temp, aes(x = TemperatureCMax)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Maximum Temperature Distribution in Colchester (2024)", x = "Maximum Temperature (°C)")+
theme_minimal()
ggplotly(max_temp_hist)
#Min Temperature distribution
min_temp_hist <- ggplot(temp, aes(x = TemperatureCMin)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Minimum Temperature Distribution in Colchester (2024)", x = "Minimum Temperature (°C)")+
theme_minimal()
ggplotly(min_temp_hist)
#TdAvgC, (average dew point Temperature) distribution
TdAvgC_hist <- ggplot(temp, aes(x = TdAvgC)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Average Dew point Temperature Distribution in Colchester (2024)", x = "Average Dew point Temperature (°C)")+
theme_minimal()
ggplotly(TdAvgC_hist)
#HrAvg - average relative humidity. Values given in %
HrAvg_hist <- ggplot(temp, aes(x = HrAvg)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Average Relative Humidity Distribution in Colchester (2024)", x = "Average Relative Humidity ( %)")+
theme_minimal()
ggplotly(HrAvg_hist)
#Viskm - visibility in kilometres
Viskm_hist <- ggplot(temp, aes(x = VisKm)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "visibility Distribution in Colchester (2024)", x = "visibility (km)")+
theme_minimal()
ggplotly(Viskm_hist)
#WindkmhInt - wind speed in km/h
WindkmhInt_hist <- ggplot(temp, aes(x = WindkmhInt)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Wind Speed Distribution in Colchester (2024)", x = "Wind Speed (km/h)")+
theme_minimal()
ggplotly(WindkmhInt_hist)
#WindkmhGust - wind gust in km/h
WindkmhGust_hist <- ggplot(temp, aes(x = WindkmhGust)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Wind Gust Distribution in Colchester (2024)", x = "Wind Gust (km/h)")+
theme_minimal()
ggplotly(WindkmhGust_hist)
#PresslevHp - Sea level pressure in hPa
PresslevHp_hist <- ggplot(temp, aes(x = PresslevHp)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Sea Level Pressure Distribution in Colchester (2024)", x = "Sea Level Pressure (hPa)")+
theme_minimal()
ggplotly(PresslevHp_hist)
Observation of histograms:
Most weather variables followed approximately normal distributions.
Humidity and pressure distributions were skewed to the right.
#Precmm - precipitation totals in mm
rainfall_denisty <- ggplot(temp, aes(x = Precmm)) +
geom_density(fill = "blue", alpha=0.5) +
labs(title = "Total rainfall Distribution in Colchester (2024)", x = "Precipitation (mm)")+
theme_minimal()
ggplotly(rainfall_denisty)
#TotClOct - total cloudiness in octants
cloudiness_denisty <- ggplot(temp, aes(x = TotClOct)) +
geom_density(fill = "blue", alpha=0.5) +
labs(title = "Total Cloudiness Distribution in Colchester (2024)", x = "Cloudiness (octants)")+
theme_minimal()
ggplotly(cloudiness_denisty)
#lowClOct - cloudiness by low level clouds in octants
low.cloudiness_denisty <- ggplot(temp, aes(x = lowClOct)) +
geom_density(fill = "blue", alpha=0.5) +
labs(title = "Cloudiness by low level clouds Distribution in Colchester (2024)", x = "Cloudiness by low level clouds (octants)")+
theme_minimal()
ggplotly(low.cloudiness_denisty)
#SunD1h - sunshine duration in hours
sun_density <- ggplot(temp, aes(x = SunD1h)) +
geom_density( fill = "blue", alpha=0.5) +
labs(title = "Sun Distribution in Colchester (2024)", x = "Sun (hours)")+
theme_minimal()
ggplotly(sun_density)
Observation for denisty plots: Total rainfall Distribution in Colchester (2024): Density Observations:
Rainfall density peaked near 0 mm, indicating mostly dry days.
Cloudiness density increased with higher coverage.
Sunshine density peaked near zero, reflecting many overcast days.
#merge data sets:
# 1. Summarize crime data per month
crime_monthly <- crime %>%
group_by(year_month, crime_season, category) %>%
summarise(
crime_count = n(),.groups = "drop")
# 2. Summarize temperature data per month
temp_monthly <- temp %>%
group_by(year_month, temp_season) %>%
summarise(
avg_temp = mean(TemperatureCAvg),
max_temp = mean(TemperatureCMax),
min_temp = mean(TemperatureCMin),
rainfall = mean(Precmm),
sunshine = mean(SunD1h),
.groups = "drop")
# 3. Join both summaries
colchester_monthly <- left_join(crime_monthly, temp_monthly, by = "year_month")
# Check
glimpse(colchester_monthly)
## Rows: 167
## Columns: 10
## $ year_month <chr> "2024-01", "2024-01", "2024-01", "2024-01", "2024-01", "2…
## $ crime_season <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Winter…
## $ category <chr> "anti-social-behaviour", "bicycle-theft", "burglary", "cr…
## $ crime_count <int> 42, 11, 19, 33, 28, 11, 34, 9, 43, 11, 48, 11, 16, 213, 6…
## $ temp_season <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Winter…
## $ avg_temp <dbl> 4.251613, 4.251613, 4.251613, 4.251613, 4.251613, 4.25161…
## $ max_temp <dbl> 7.348387, 7.348387, 7.348387, 7.348387, 7.348387, 7.34838…
## $ min_temp <dbl> 0.7419355, 0.7419355, 0.7419355, 0.7419355, 0.7419355, 0.…
## $ rainfall <dbl> 1.748387, 1.748387, 1.748387, 1.748387, 1.748387, 1.74838…
## $ sunshine <dbl> 2.832258, 2.832258, 2.832258, 2.832258, 2.832258, 2.83225…
#colchester_monthly <- crime %>%
#left_join(temp, by = "year_month") # Ensure datasets are merged
# crime counts vs sunshine violin plot
crime_sun_violin <- ggplot(colchester_monthly,
aes(x = factor(format(ym(year_month), "%b"),levels = month.abb),
y = crime_count, colour = factor(format(ym(year_month), "%b")))) +
geom_violin(trim = FALSE) +
labs(title = "Crime Distribution by Month in Colchester 2024",
x= "Month",
y= "Number of Crimes") +
stat_summary(fun.y = median, geom='point')+
theme_minimal()
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
crime_sun_violin <-crime_sun_violin + guides(color=FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(crime_sun_violin) %>%
layout(hoverlabel = list(bgcolor = "white"),
xaxis = list(title = "Month"),
yaxis = list(title = "crime count"))
Observation for Violin, Crime Distribution by Month in Colchester 2024:
Crime was fairly evenly distributed throughout the year.
Autumn/Winter months (September, october, November, December and January) showed greater spread, suggesting more variability.
sun.rain_scatter <- ggplot(temp, aes(x = SunD1h, y = Precmm)) +
geom_point(alpha = 0.5, color = "red") +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(title = "Sunshine vs. Rainfall", y = "rainfall (mm)", x = " Sunshine (hours)")
ggplotly(sun.rain_scatter)
# Scatter plot: max vs min temperature
max.min_scatter<-ggplot(temp, aes(x = TemperatureCMin, y = TemperatureCMax)) +
geom_point(alpha = 0.5, color= "red") +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(title = "Scatter Plot of Daily Min vs Max Temperatures",
x = "Min Temperature (°C)",
y = "Max Temperature (°C)")
ggplotly(max.min_scatter)
# Scatter Plot of sunshine vs Max Temperatures
sun.max_scatter<- ggplot(temp, aes(x = SunD1h, y = TemperatureCMax)) +
geom_point(alpha = 0.5, color= "red") +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(title = "Scatter Plot of Sunshine vs Max Temperatures",
x = "sunshine (hours)",
y = "Max Temperature (°C)")
ggplotly(sun.max_scatter)
observation for scatter plots: Sunshine vs. Rainfall:
Scatter Plot of Daily Min vs Max Temperatures:
Scatter Plot of Sunshine vs Max Temperatures:
#correlation analysis between temp variables and crime counts
#library(ggcorrplot)
numeric_cols <- colchester_monthly %>% select(where(is.numeric))
pval.cor<- cor_pmat(numeric_cols)
corrmat<-round(cor(numeric_cols),1)
colch.monthly_cor<- ggcorrplot(corrmat, hc.order = TRUE, type="lower",
p.mat=pval.cor, sig.level= 0.01, insig= "pch",
pch=4, pch.col = "black")
ggplotly(colch.monthly_cor)
Observation for Correlation analysis between temp variables and crime counts revealed:
red, positive correlation
white, no correlation
purple, negative correlation
spaces marked with “X”, indicate no statistically significant correlation at defined alpha (0.01 or 0.05)
crime_count shows no strong linear correlation with the temp variables
while temp variables such as (avg_temp, min_temp, max_temp) are strongly positively correlated with one another. Expected, as temp variables represent different measures of the same underlying factor (temperature), and naturally react in tandem.
#count crimes by date
crime_over_time <- crime %>%
count(date_parsed, name = "crime_count") # Count crimes per day
# Create the time series plot
crime_over_time_plot <- ggplot(crime_over_time, aes(x = date_parsed,
y = crime_count)) +
geom_line(color = "steelblue") +
geom_point(color = "red") +
labs(
title = "Daily Crime Incidents in Colchester (2024)",
x = "Date",
y = "Number of Crimes") +
theme_minimal() +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#interactive
ggplotly(crime_over_time_plot)
#time series plot to observe the highest crime category over the year
# Create violent crimes data frame
v_df <- crime %>%
filter(category == "violent-crime") %>% # Filter for violent crimes
group_by(year_month) %>%
summarise(total_v = n(), .groups = 'drop') # Count occurrences
# Prepare variables for time series plotting
v.months <- ym(v_df$year_month) # Convert to date object
v_crimes <- as.numeric(v_df$total_v) # Total violent crimes by year_month to numeric for plot
# Create a data frame for plotting
v_plot_data <- data.frame(months = v.months, crimes = v_crimes)
# Plot using ggplot
v_ts.plot <- ggplot(v_plot_data, aes(x = months, y = crimes)) +
geom_point(color = "red") +
geom_line(color = "steelblue") +
labs(
title = "Violent Crimes in Colchester in 2024 by Month",
x = "Year",
y = "Number of Violent Crimes") +
theme_minimal()+
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") + #add individual coordinate xlabs
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create interactive plot
v_ts.plotly <- ggplotly(v_ts.plot)
v_ts.plotly
Observation for time series of Violent Crimes in Colchester in 2024 by Month:
July had the highest violent crime count (242 incidents).
April had the lowest (162 incidents).
Crime rates were higher during summer months, while spring and autumn showed lower rates.
However, the pattern is not linear, Several months (for example: May and October) had declines in violent crime, indicating the presence of other influencing factors, potentially local events, law enforcement interventions, or socioeconomic conditions.
Overall, the data supports the broader narrative that crime rates, particularly violent crime, tend to increase during the summer months, reinforcing the rationale behind seasonal crime analysis in this report.
#visualising crime on colchester map 2024
crime_map<- crime%>%
group_by(lat, long, street_name)%>%
summarise(crimes=n(), .groups = 'drop')%>% # Count occurrences
#plot map
leaflet()%>%
addTiles()%>%
setView(0.901028, 51.89301, zoom= 12)%>%
addCircleMarkers(radius=~crimes*0.13, color= "red",
popup = ~paste(street_name, "<br>Crimes: ", crimes))
crime_map
Observation for Interactive Leaflet map of crime counts and locations: * Hotspots for crime included town centre mainly on or near shopping areas (230 counts), the police station (166), on or near george street (140) and so on.
These areas likely experienced more crime due to higher pedestrian traffic, which increases the opportunity for offences such as theft and robbery.
A potential limitation of this analysis was that the specific types of crime occurring in each location were not explored. Future work could involve spatially filtering crime categories to identify whether certain areas are more prone to specific offences, using tools such as clustered point analysis or location-based filtering in Leaflet.
The findings revealed a seasonal pattern in crime rates, with summer emerging as the least safe period in 2024. July had the highest number of violent crime incidents, likely driven by increased social activity and warmer weather. Spring and winter recorded comparatively lower crime levels.
Although strong variable relationships were identified among weather features, no statistically significant linear relationships between weather variables and crime counts were observed. This implied that the connection between these weather factors and crime was indirect or non-linear.
Spatial analysis highlighted concentrated crime activity in Colchester’s central areas, indicative of the influence of geographic and demographic factors (for example: high streets, main roads, shopping malls and entertainment venues). These locations tend to attract larger crowds and provide greater anonymity for offenders, making the public more vulnerable to crimes such as theft, assault, or anti-social behaviour.
This analysis confirmed that crime in Colchester displayed seasonal variation, with the summer months representing a period of higher crime risk. However, the absence of significant linear correlations between these weather and crime features suggests that other social and environmental factors may have influence.
Based on the findings, it is recommended that crime prevention efforts focus more intensively on the summer months and in urban hotspots. Further research should consider incorporate socio-economic data and employ non-linear models to more effectively capture the underlying drivers of crime. Additionally, future studies could investigate the types of crimes committed at specific locations to better tailor interventions.